Taming the zoo - about algorithms implementation in the ecosystem of Apache Hadoop
نویسندگان
چکیده
Content Analysis System (CoAnSys) is a research framework for mining scientific publications using Apache Hadoop. This article describes the algorithms currently implemented in CoAnSys including classification, categorization and citation matching of scientific publications. The size of the input data classifies these algorithms in the range of big data problems, which can be efficiently solved on Hadoop clusters.
منابع مشابه
Feedback - Study and Improvement of the Random Forest of the Mahout library in the context of marketing data of Orange
In the realm of Big Data systems, Hadoop has emerged as one of the most popular systems and a very diverse ecosystem has grown around it, meeting all kinds of functional and technical needs. One niche that should have been a place of choice in this ecosystem is data analytics: first because getting value out of large datasets requires efficient Machine Learning (ML) algorithms, second because l...
متن کاملObject-Tagged RBAC Model for the Hadoop Ecosystem
Hadoop ecosystem provides a highly scalable, fault-tolerant and cost-effective platform for storing and analyzing variety of data formats. Apache Ranger and Apache Sentry are two predominant frameworks used to provide authorization capabilities in Hadoop ecosystem. In this paper we present a formal multi-layer access control model (called HeAC) for Hadoop ecosystem, as an academic-style abstrac...
متن کاملA BigBench Implementation in the Hadoop Ecosystem
BigBench is the first proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DS. The first proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test oth...
متن کاملHadoop Block Placement Policy for Different File Formats
Now a day’s Peta-Bytes of data becomes the norm in industries. Handling, analyzing such big data is challenging task. Even frameworks like Hadoop (Open Source Implementation of MapReduce Paradigm) and NoSQL databases like Cassandra, HBase can be used to analyze and store such large data; heterogeneity of data is still an issue. Data centers usually have clusters formed using heterogeneous nodes...
متن کاملApache Pig's Optimizer
Apache Pig allows users to describe dataflows to be executed in Apache Hadoop. The distributed nature of Hadoop, as well as its execution paradigms, provide many execution opportunities as well as impose constraints on the system. Given these opportunities and constraints Pig must make decisions about how to optimize the execution of user scripts. This paper covers some of those optimization ch...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1303.5367 شماره
صفحات -
تاریخ انتشار 2013